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ABSTRACT 

We  present  the  newest  implementation  of  the  LINGSTAT 
machine-aided  translation  system.  The  most  rignificant 
change  &om  earlier  versions  is  a  new  set  of  modnles  that  pro¬ 
duce  a  draft  translation  of  the  document  for  the  user  to  refer 
to  or  modify.  This  paper  describes  these  modnles,  with  spe¬ 
cial  emphasis  on  an  automatically  trained  lexicalized  gram¬ 
mar  used  in  the  parsing  module.  Some  preliminary  results 
&om  the  January  1994  ARPA  evaluation  are  reported. 

1.  INTRODUCTION 

LINGSTAT  is  an  interactive  machine-aided  translation 
system  designed  to  increase  the  productivity  of  a  trans¬ 
lator.  It  is  aimed  both  at  experienced  users  whose  goal 
is  high  quality  translation,  and  inexperienced  users  with 
little  knowledge  of  the  source  whose  goal  is  simply  to  ex¬ 
tract  information  from  foreign  language  text.  (For  an  in¬ 
troduction  to  the  basic  structure  of  LINGSTAT,  see  [1].) 

The  hrst  problem  to  be  studied  is  Japanese  to  English 
translation  with  an  emphasis  on  text  from  the  domain 
of  mergers  and  acquisitions,  although  recent  evaluations 
have  included  general  newspaper  text  as  well.  Work  is 
also  progressing  on  a  Spanish  to  English  system.  The 
approach  described  below  represents  the  current  state  of 
the  Japanese  system,  and  will  be  applied  with  minimal 
changes  to  Spanish. 

Due  to  the  special  difficulties  presented  by  the  Japanese 
writmg  system,  previous  versions  of  LINGSTAT  have 
focused  on  developing  tools  for  the  lexical  analysis  of 
Japanese  (such  as  tokenization  of  the  Japanese  character 
stream,  morphological  analysis,  and  katakana  transliter¬ 
ation),  and  on  providing  the  user  access  to  lexical  infor- 
m^ltion  (such  as  pronunciations,  glosses,  and  definitions) 
via  online  lookup  tools.  In  addition,  a  simple  parser  was 
incorporated  to  identify  modifying  phrases.  No  trans¬ 
lation  of  the  document  was  provided.  Instead,  the  user 
used  the  results  of  the  above  analyses  and  the  online 
tools  to  construct  a  translation. 

In  the  newest  version  of  LINGSTAT,  the  user  is  pro- 
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vided  with  a  draft  translation  of  the  source  document. 
For  a  source  language  similar  to  English,  a  starting  point 
for  such  a  draft  might  be  a  word-for-word  translatitm, 
but  because  Japanese  word  (»der  and  sentence  structure 
are  so  different  from  English,  a  more  general  framework 
has  been  constructed.  The  translation  process  in  LING¬ 
STAT  consists  of  the  following  steps: 

•  Tokenization  and  morphological  analysis 

•  Parsing 

•  Rearrangement  of  the  source  into  English  order 

•  Annotation  and  selection  of  glosses  via  an  English 
language  model 

These  modules  are  described  in  Section  2  below.  Sec¬ 
tion  3  gives  some  preliminary  results  from  the  January 
1994  evaluation,  and  Section  4  discusses  some  plans  for 
future  improvements. 

2.  IMPLEMENTATION 

Tokenization  /  de-inflection 

In  LINGSTAT,  “tokenization”  refers  to  the  process  of 
breaking  a  source  document  into  a  sequence  of  root 
words  tagged,  if  necessary,  with  inflection  information. 
For  most  languages,  the  tokenizer  is  basically  an  en- 
^ne  that  oversees  the  de-inflection  of  source  words  into 
root  forms.  For  languages  like  Jiq;>anese,  written  without 
spaces,  the  tokenizer  also  has  the  job  of  segmenting  the 
source. 

To  segment  Japanese,  the  LINGSTAT  tokenizer  uses  a 
probabilistic  dynamic  programming  algorithm  to  break 
up  the  character  stream  into  the  sequence  of  words  that 
maximizes  the  product  of  word  unigram  probabilities,  as 
supplied  from  a  list  of  300,000  words.  Inflected  forms  are 
recognized  during  tokenization  by  a  de-inflector  module. 
This  module  has  a  language-independent  engine  driven 
by  a  language-specific  de-inflection  table.  (More  details 
on  the  function  of  these  components  can  be  found  in  [1].) 

There  have  been  two  improvements  in  the  tokenizer /de- 
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inflector  module  in  the  newer  versions  of  the  system, 
made  possible  by  the  introduction  of  part  of  speech  in¬ 
formation  mto  the  word  list.  The  first  is  an  extra  check 
on  the  validity  of  suggested  de-inflections  by  demand¬ 
ing  consistency  between  the  inflection  and  the  part  of 
speech  of  the  proposed  root.  This  has  cleanly  eliminated 
a  number  of  spurious  de-inflections  that  were  previously 
handled  in  a  more  ad-hoc  fashion.  The  second  improve¬ 
ment,  motivated  more  by  plans  to  move  on  to  Spanish,  is 
to  stop  the  tokenizer  firom  attempting  to  uniquely  spec¬ 
ify  the  de-inflection  path  (and  now,  part  of  speech)  for 
each  token  it  finds.  As  an  example  of  the  problem  this 
addresses,  consider  the  two  de-inflections  of  the  Spanish 
word  ayudas: 

ayudas  —*  ayuda  (help,  aid) 

ayudas  ayudar  (to  help,  to  aid) 

The  original  tokenizer  made  a  choice  between  the  noun 
and  verb  de-inflection  based  on  the  unigram  frequency 
of  the  root.  The  new  tokenizer  still  finds  all  allowed 
possibilities,  but  now  simply  passes  them  to  the  parser, 
which  is  better  equipped  to  resolve  the  ambiguity. 

Parsing 

The  parser  in  LINGSTAT  has  two  roles.  In  the  inter¬ 
active  component,  information  about  modifying  phrases 
is  extracted  from  the  parse  and  presented  to  the  user  as 
an  aid  to  understanding  the  structure  of  each  Japanese 
sentence.  In  the  automatic  component,  the  parse  is  the 
basis  for  the  rearrangement  of  the  Japanese  sentence  into 
English. 

Because  it  is  a  long-term  goal  to  have  a  system  that  can 
be  quickly  adapted  to  new  domains  and  languages,  a 
high  priority  is  placed  on  developing  parsing  techniques 
that  are  capable  of  extracting  some  information  auto¬ 
matically  through  training  on  new  sources  of  text,  thus 
minimizing  the  amount  of  human  effort.  In  the  cur¬ 
rent  system,  this  has  led  to  a  two-stage  parsing  process. 
The  first  stage  inq>lements  a  coarse  probabilistic  context- 
free  grammar  of  a  few  hundred  hiunan-supplied  rules 
eu:ting  on  parts  of  speech.  Because  of  this  coarseness, 
some  parsing  ambiguities  remain  to  be  resolved  by  the 
second-stage  parser,  which  implements  a  simple,  lexical- 
ized,  probabilistic  context-free  grammar  trained  on  word 
co-occurrences  in  unlabeled  Japanese  sentences  without 
human  input. 

Context-free  parser.  The  first-stage  parse  is  done  us¬ 
ing  a  standard  probabilistic  context-free  grammar  acting 
on  about  50  parts  of  speech.  Any  ambiguities  in  part  of 
speech  assignments  or  de-inflection  paths  passed  by  the 
tokenizer/de-inflector  are  resolved  based  on  the  prob¬ 
ability  of  possible  parses.  The  grammar  is  allowed  to 


contain  unitary  and  null  productions,  which  impose  an 
ordering  on  the  summation  over  rules  that  takes  place 
during  training;  because  there  are  currently  only  a  few 
hundred  rules,  this  ordering  is  checked  by  hand.  The 
grammar  can  be  trained  with  either  the  Inside-Outside 
[2]  or  Viterbi  [3]  algorithm. 

It  is  essential  that  the  parser  return  a  parse,  even  a  bad 
one,  for  subsequent  processing.  Therefore  special,  low- 
probability  ‘junk”  rules  have  been  introduced  to  handle 
unanticipated  constructions.  These  junk  rules  affect  the 
generation  of  terminal  symbols  and  take  the  following 
form:  for  each  rule  in  which  a  non-terminal  generates  a 
particular  terminal,  a  rule  is  added  permitting  the  same 
n(»-terminal  to  generate  any  other  terminal  with  a  small 
probability.  This  allows  the  grammar  to  force  the  termi¬ 
nal  string  into  a  sequence  that  has  a  recognizable  parse, 
but  at  a  high  enough  cost  such  that  any  parse  without 
such  coersion  wUl  be  favored.  One  advantage  of  this  ap¬ 
proach  is  that  the  grammar  can  compensate  for  nussing 
or  mislabeled  data.  Consider  the  fragment 

thejet  largeadu  dog  noun 

in  which  the  adjective  large  has  been  mislabeled  as  an 
adverb.  The  junk  rule  permits  the  grammar  to  change  its 
part  of  speech  to  something  more  appropriate  provided 
no  other  sensible  parse  can  be  found. 

In  principle,  the  probability  of  invoking  the  junk  rule 
could  be  trained  with  the  other  rules  in  the  grammar  (the 
example  above  suggests  that  it  might  be  advantageous 
to  do  so).  Currently  this  is  not  being  done,  based  on  the 
observation  that  an  invocation  of  the  junk  rule  is  more 
likely  an  indication  of  a  deficiency  in  the  grammar  than 
a  useful  correction  to  the  data. 

Lexicaliaed  parser.  The  grammar  implemented  by  the 
context-free  parser  is  not  fine  enough  to  prqperly  reserve 
certain  kinds  of  ambiguity,  such  as  the  correct  attach¬ 
ment  of  prepositional  phrases  or  noun  modifiers.  These 
attachment  problems  are  bandied  by  a  second  parser, 
which  does  a  top-down  rescoring  of  certain  probabUities 
computed  in  the  first  stage.  Currently  this  rescoring 
is  used  to  fine-tune  attachments  of  particle  phrases  in 
Japanese  sentences. 

The  second  parser  makes  use  of  a  second  probabilistic 
granunar,  one  whose  basic  elements  are  the  words  them¬ 
selves,  and  whose  data  consist  of  the  probabilities  of  each 
word  in  the  vocabulary  to  be  generated  in  the  context 
of  any  other  word.  Like  a  bigram  language  model,  these 
probEd>ilities  can  be  trained  on  word  co-occurrences  in 
unlabeled  sentences,  but  unlike  bigrams,  the  grammar 
can  learn  about  associations  between  words  in  a  sentence 
regardless  of  their  separation. 
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This  very  simple  context-free  grammar  can  be  described 
as  follows.  To  each  word  in  the  vocabulary  we  associate 
a  terminal  symbol  w  (the  word  itself)  and  a  non-terminal 
symbol  A^, .  The  grammar  consists  of  the  fcdlowing  two 
kinds  of  rules: 

Av)i  ^  AfuaWiAuia-^wi  t  (la) 

)  (1^) 

where  (f)  represents  the  null  production.  In  addition,  we 

introduce  a  sentence  start  symbol  Aq  with  the  produc¬ 
tion 

Aq  — »  AaowAw  .  (2) 

The  probability  of  invoking  a  particular  rule  depends 
only  on  the  word  associated  with  the  generating  non¬ 
terminal  and  the  terminal  word  in  the  production.  The 
probabilities  for  (la)  and  (lb)  can  therefore  be  written 
p(u;i  — » W2)  and  p(ivi  — ♦  ^),  respectively,  and  these  sat¬ 
isfy 

p(m  <t>)  +  2p(»"i  —*  lUj)  =  1  . 

wa 

For  the  start  symbol,  the  probabilities  are  |>(0  —*  w)  and 
satisfy 

x;p(o-ts)=i. 

w 

There  is  no  null  production  for  the  start  symbol. 

Roughly  speaking,  this  grammar  generates  a  sentence  in 
the  following  manner.  The  start  symbol  first  generates 
some  word  in  the  sentence.  This  word  then  generates 
some  number  of  words  to  its  left  and  right,  which  in 
turn  generate  other  words  to  their  left  and  right.  From 
the  form  of  the  grammar  it  can  be  deduced  that  these 
generations  are  “local,”  in  the  sense  that  if  tx^i  generates 
W2  on  its  right,  u;2  is  not  allowed  to  generate  any  word 
to  the  left  of  wi  (and  similarly  for  wi  generating  Wj  cm 
its  left).  The  process  continues  in  a  cascading  fashion 
until  the  whole  sentence  has  been  generated.  The  fertil¬ 
ity  of  a  particular  word  w  (t.e.,  the  number  of  words  it 
will  typically  generate)  is  determined  by  the  probability 
p(w  — ►  <l>),  as  can  be  seen  from  examining  the  produc¬ 
tions  (1):  a  non-terminal  A^  will  continue  to  produce 
words  through  rule  (la)  via  tail  recursion  until  rule  (lb) 
is  mvoked. 

Although  this  grammar  has  the  same  type  and  number 
of  parameters  as  a  bigram  model,  here  they  have  a  very 
different  interpretation:  they  measure  the  probability  of 
one  word  to  generate  another  anywhere  in  the  sentence, 
subject  only  to  the  constraints  imposed  by  the  generar 
tion  process  described  above.  Thus  an  association  be¬ 
tween  two  words  that  might  typically  appear  together, 
such  as  fast  and  car,  will  be  recognized  even  if  wother 
word  might  occasionally  intervene,  such  as  red.  Another 


feature  is  that  words  with  the  most  predictive  power  in 
a  sentence  tend  to  generate  words  with  less  predictive 
power,  which  has  the  consequence  that  words  like  iht 
tend  to  generate  no  words  at  all.  This  is  an  improve¬ 
ment  over  a  bigram  model  in  which  the  is  required  to 
select  a  succeeding  word  from  a  distribution  that  is  es¬ 
sentially  flat  across  a  large  portion  of  the  vocabulary. 

This  grammar  shares  the  appealing  feature  of  n-gram 
models  that  its  parameters  can  be  trsuned  on  unlabeled 
text  (consisting  of  whole  sentences).  In  this  case,  how¬ 
ever,  the  training  procedure  is  iterative — a  modification 
of  the  Inside-Outside  algorithm  that  is  of  order  N*  in 
the  sentence  length.^  The  iteration  starts  from  a  flat 
distribution,  with  co-occurrences  of  words  within  sen¬ 
tences  leading  to  enhanced  probabilities  for  some  wewds 
to  generate  others. 

The  N*  algorithm  actually  applies  to  a  slightly  different 
(but  generatively  equivalent)  grammar  than  the  one  de¬ 
fined  by  rules  (1)  and  (2).  To  implement  this  algorithm, 
we  first  replace  rule  (la)  by 

-♦  AuijUijAu,, (ti>2  to  the  left  of  u»i)  , 

Au,,  -+  Au,jU)2Att,a  («;2  to  the  right  of  wi)  , 

where  the  probability  of  both  rules  is  the  same  and  given 
by  piyii  -*W2).  The  only  difference  between  this  and 
rule  (la)  is  that  when  generates  multiple  words  to 
the  right  w,  they  are  generated  right  to  left  instead  of 
left  to  right. 

As  w  example  of  how  the  N*  dependence  arises,  consider 
the  inside  calculation  for  this  model.  For  a  sentence 
wi  ...Wff,  the  quantities  of  interest  for  the  inside  pass 
are  the  probabilities  /(A^, .  -*Wj  ... u^,_i)  for  j  <  i  and 
/(A,0,  -*  . . .  wy)  for  y  >  t.  These  may  be  calculated 

recursively  by  the  following  formulae: 

i-l  i-l 

/(A*,  u),_i)  =  53  “’k) 

k=i i=k 

x/(A 

Wjk  -*  Wj  . .  .Wk.i)I{A  wa  ^k+l  •  •  -Vil) 

X  I{Aan  wi+i . . .  wi-i)  ,  (3a) 

i  i-l 

I{Aaai-^Wi+i...Wj)=  53  53f(”'‘~""’‘) 
k=i+l  I=i 

X  I{Aa,„  -*■  Wt+i  . . .  Wj)I{A„,„  -*■  W,+i  . . .  Wt.i) 

X  /(A„,  Wi+i  ...wi),  (3b) 

where  the  “negative  length”  string  Wi .  ..Wi-i  is  under¬ 
stood  to  represent  the  null  production  <fi.  The  recursion 

^The  authors  would  like  to  thank  Joshua  Goodman  for  devd- 
<^ing  the  N*  procedure,  a  notaUe  improvement  over  previous 
imidementations. 
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is  initiiilized  by 

-^4)  =  vivJi  ■ 

The  above  computations  involve  a  double  sum  and  are 
therefore  of  order  N^,  and  there  are  order  probabil¬ 
ities  I{wi  —*  Wj  ... Wi-i)  and  I(Awi  “'f+i  •  •  •  for 
a  total  of  N*.  (For  the  Viterbi  calculation,  one  simply 
selects  the  leirgest  contribution  from  the  right  hand  side 
of  equations  (3a)  and  (3b)  instead  of  doing  the  double 
sum.) 

It  is  important  to  note  that  despite  the  N*  behavior,  this 
grammar  is  in  general  faster  than  context-free  parsing, 
which  is  computationally  of  order  N^.  This  is  because 
the  compute  time  for  context-free  parsing  also  includes 
a  factor  proportional  to  the  number  of  rules  in  the  gram¬ 
mar,  which  even  in  simple  cases  can  be  in  the  hundreds. 
There  is  no  such  factor  in  the  computation  for  this  lex- 
icalized  grammar — it  is  effectively  replaced  by  another 
power  of  N,  which  is  much  smaller. 

To  see  how  the  probabilities  p{wi  tnj)  converge,  this 
model  was  run  through  ten  iterations  of  training  on  ap¬ 
proximately  100,000  sentences  of  ten  words  or  less  from 
the  English  half  of  the  Canadian  Hansard  corpus.  Some 
examples  of  these  probabilities  follow: 

the  U.  tariffs 

.91  <f>  .52  ^  .U<f> 

.26  S.  .09  agreement 

.14  agreement  .08  and 

.08  general 
.08  on 

As  expected,  the  trains  strongly  to  generate  the  null  sym¬ 
bol  The  token  U,  has  a  strong  tendency  to  generate 
S.  for  obvious  reasons;  that  it  also  generates  agreement 
is  a  consequence  of  the  frequent  discussion  in  the  corpus 
of  the  U.  S.  free  trade  agreement.  This  is  an  example  of 
how  the  model  will  find  associations  between  separated 
words  that  even  a  trigram  model  will  not  see.  The  distri¬ 
bution  associated  with  tariffs  arises  fr<»n  parliamentary 
debate  on  the  general  agreement  on  tariffs  and  trade. 

The  simple  grammar  described  above  can  be  considered 
the  starting  point  for  a  class  of  more  complex  models. 
One  obvious  extension  is  to  train  the  probability  distri¬ 
butions  for  generating  to  the  left  and  right  separately. 
This  corresponds  to  implementing  the  granunar 

^  .  (4a) 

^w^ ,  A2^-*4>.  (4b) 

IVaining  this  grammar  on  the  same  text  as  the  original 


model  yields  the  left  probabilities: 

tariffs 
.35  <f> 

.14  agreement 
.12  general 
.12  on 

Agmn,  the  tends  to  generate  a  null.  Like  most  nouns, 
U.  has  learned  to  generate  a  the  to  its  left,  and  the  left 
distribution  for  tariffs  includes  only  those  words  found 
typically  on  its  left.  The  right  probabilities  for  the  same 
words  are: 

the  V.  tariffs 

.90  ^  .36  <f>  .52  if> 

.37  S.  .18  and 

.19  agreement  .17  trade 
.07  free 

These  are  also  consistent  with  the  results  from  the  orig¬ 
inal  model. 

Rearrangement 

The  next  step  in  LINGSTAT’s  translation  method  is 
a  transfer  of  the  parse  of  each  Japanese  sentence  into 
a  corresponding  English  parse,  pving  an  English  word 
ordering.  This  is  accomplished  through  the  use  of 
English  rewrite  rules  encoded  in  the  Japanese  gram¬ 
mar.  Through  this  encoding,  each  non-terminal  in  the 
Japanese  grammar  corresponds  to  a  non-terminal  in  an 
implied  English  grammar.  The  rewrite  process  just  con¬ 
sists  of  taking  the  Japemese  parse  and  expanding  in  this 
English  grammar.  As  this  expansion  proceeds,  Japanese 
constructs  that  are  not  trtmslated  (certain  particles,  for 
example)  are  removed,  and  tokens  for  English  constructs 
not  represented  in  the  Japanese  (such  as  articles)  are  in¬ 
troduced. 

Annotation/language  model 

The  Japanese  words  in  the  reordered  sentenced  are  anno¬ 
tated  with  (possibly  several)  candidate  English  glosses, 
supplied  from  an  electronic  dictionary  ccunpikd  fr<xn 
various  sources.  Numbers  are  translated  directly,  and 
katakana  tokens  (which  are  usually  borrowed  foreign 
words)  are  transliterated  into  English.  Tokens  intro¬ 
duced  in  the  rearrangement  step  are  also  glossed;  the 
token  indicating  an  English  article  is  multiply  glossed  as 
the,  a,  an,  emd  null  (which  expands  to  an  empty  word). 

Inflected  Japanese  words  are  glossed  by  first  glossing  the 
root,  then  applying  an  English  version  of  the  Japanese 
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inflection  to  each  candidate.  This  is  made  difficult  by  the 
poor  correspondence  between  Japanese  and  English  in¬ 
flections:  English  is  inflected  for  person  and  number,  for 
example,  while  in  Japanese  there  are  inflections  for  such 
constructions  as  the  causative,  which  require  non-local 
changes  in  the  corresponding  English.  Japanese  inflec¬ 
tions  also  often  consist  of  multiple  steps,  which  means 
that  the  English  inflections  must  be  compounded.  For 
example,  to  inflect  the  verb  io  walk  into  the  past  desider- 
ative  involves  the  two  step  transformation, 

io  walk  —>■  io  wani  io  walk  —*  wanied  io  walk  . 

This  procedure  can  produce  some  unusual  results  when 
the  number  of  inflection  steps  is  greater  than  two. 

The  final  step  in  the  translation  process  is  to  apply  an 
English  language  model  to  select  the  best  gloss  from 
among  the  many  candidates  for  each  word.  In  the  cur¬ 
rent  system  this  is  done  with  a  trigram  model,  which 
makes  the  choices  that  maximize  the  average  probabil¬ 
ity  per  word.  The  trigram  model  used  was  trained  on 
Wall  Street  Journal  and  so  has  a  business  bias,  partially 
reflecting  the  bias  of  the  evaluation  texts. 

3.  RESULTS 

The  January  1994  ARPA  machine  translation  evalua¬ 
tion  has  recently  been  completed.  In  this  test.  Dragon 
used  the  same  translators  as  in  the  May  1993  evaluation 
and  provided  them  with  essentially  the  same  interface 
and  online  tools.  The  difference  in  this  evaluation  was 
that  the  tramslators  were  also  provided  an  automatically 
generated  English  translation  of  the  Japanese  document 
as  a  first  draft.  Manual  and  machine-assisted  translation 
times  were  measured,  and  the  automatic  output  was  also 
submitted  for  separate  evaluation. 

Preliminary  timing  results  show  a  speedup  by  a  factor  of 
2.4  in  machine-assisted  vs.  manual  translation.  Because 
we  were  using  the  May  1993  translators,  this  result  may 
be  compared  to  the  May  1993  result;  it  is  essentially 
unchanged.  This  suggests  that  the  draft  translation  was 
of  no  significant  help  to  the  translators  in  this  evaluation, 
probably  because  the  quality  of  automatic  output  is  not 
high  enough  to  be  relied  upon. 

A  quality  measurement  of  the  automatic  output  is  not 
yet  available,  but  we  offer  one  example  of  a  sample  trans¬ 
lation  from  the  current  system.  For  the  following  cor¬ 
rectly  glossed  Japanese  sentence, 

(America)  (investment  bank)  NO  (Wertheim) 

(Schroder)  WA  ,  (Mitsubishi  Trust  and  Banking 

Corporation)  (to)  (same  company)  NO  (stock)  NO 

(4.9%)  NO  (sell)  KOTO  WO  (decided) 


LINGSTAT  produced 

waaxuhaimu  shuroodaa  of  the  America  investment 
bank  decided  to  sell  off  4-9%  of  the  shares  of  the 
same  company  to  Mitsubishi  Trust  and  Banking 
Corporation 

Even  this  simple  sentence  demonstrates  the  large  anoount 
of  rearrangement  necessary  to  render  the  Japanese  into 
English.  This  effort  is  not  without  errors;  a  correct  trans¬ 
lation  shows  that  the  word  meaning  same  company  was 
mishandled,  as  was  the  modifier  of  Wertheim  Schroder: 

The  American  investment  bank  WerOteim  Schroder 
htu  decided  to  sell  4-9%  of  its  stock  to  the  Mitsubishi 
Trust  and  Banking  Corporation 

This  sentence  is  less  complex  than  is  typical  in  a 
Japanese  newspaper  article,  and  therefore  LINGSTAT’s 
performance  in  this  case  is  not  representative. 

4.  FUTURE  PLANS 

The  steps  that  have  the  most  effect  on  the  quality  of  the 
final  output  translation  (at  least  for  Japanese)  are  the 
parser  and  gloss  selection  modules.  The  parser  in  partic¬ 
ular  is  crucial,  since  it  initiates  a  global  rearrangement 
of  the  sentence  into  a  sensible  English  order — a  parsing 
mistake  will  often  render  a  sentence  unintelligible. 

The  improvements  contemplated  for  the  parsing  mod¬ 
ule  include  more  hand  work  on  the  coarse  context-free 
grammar  to  provide  more  accurate  parses,  and  a  gen¬ 
eral  speedup  to  allow  more  extensive  training.  A  faster 
parser  would  also  allow  the  merging  of  the  two  grammars 
so  that  they  could  be  trained  nmultaneously.  Attempts 
to  do  this  have  so  far  resulted  in  an  unacceptable  increase 
in  truning  and  parsing  time  due  to  the  c(»nplexity  of  the 
alg(»ithm. 

The  language  model  used  to  select  glosses  in  the  final 
translation  step  must  be  improved  to  have  more  global 
control.  Common  noistakes  made  by  the  current  model 
include  inconsistent  glossing  of  a  recurring  word  and  vir¬ 
tually  no  notion  of  topic  or  domain  (except  on  business 
subjects).  Both  of  these  problems  are  the  result  of  us¬ 
ing  a  language  model,  trigrams,  that  uses  such  restricted 
context. 

The  newest  version  of  the  system  must  be  ported  to 
Spanish  for  the  next  evaluation,  scheduled  for  June.  This 
will  require  improvements  to  the  Spanish  dictionary  and 
de-inflector,  an  update  of  the  Spanish  grammar  from  the 
older  Spanish  system,  a  lexicalized  grammar  trained  on 
Spanish  text,  and  Spanish  rewrite  rules.  We  intend  to 
use  the  parallel  Spwish-English  component  of  the  UN 
data  to  provide  gloss  information. 
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